Gemini 多模態大型語言模型大小事 Day7 - 瞭解 Gemini API 的文件處理功能

16th鐵人賽

kevin_chiu

2024-09-17 23:28:24

214 瀏覽

分享至

前言

   程式環境都會用colab 來執行程式，如果要在其他環境執行，請自行修改哦

colab 事前準備：設定專案和 API 金鑰
載入gemini

#pip install -q -U google-generativeai
import google.generativeai as genai

API 金鑰

from google.colab import userdata
API_KEY=userdata.get('GOOGLE_API_KEY')

#genai.configure(api_key="YOUR_API_KEY")

# Configure the client library by providing your API key.
genai.configure(api_key=API_KEY)

Google Drive 權限

from google.colab import drive
drive.mount('/content/drive')

瞭解 Gemini API 的文件處理功能

Gemini API 可處理傳入的 PDF 文件，並進行推論。時間上傳 PDF 檔案後，Gemini API 就能：

說明或回答內容相關問題
提供內容的摘要
從內容推斷

使用 File API 上傳文件

!curl -o gemini.pdf https://storage.googleapis.com/cloud-samples-data/generative-ai/pdf/2403.05530.pdf

# Upload the file and print a confirmation
sample_file = genai.upload_file(path="gemini.pdf",
                                display_name="Gemini 1.5 PDF")

print(f"Uploaded file '{sample_file.display_name}' as: {sample_file.uri}")

驗證 PDF 檔案上傳及取得中繼資料

file = genai.get_file(name=sample_file.name)
print(f"Retrieved file '{file.display_name}' as: {sample_file.uri}")

使用上傳的文件，提示 Gemini API

# Choose a Gemini model.
model = genai.GenerativeModel(model_name="gemini-1.5-flash")

# Prompt the model with text and the previously uploaded image.
response = model.generate_content([sample_file, "您能否將這份文件總結為項目符號清單？"])

print(response.text)

回答

當然可以。以下是文件的摘要：

* **簡介**。這篇文章介紹了 Gemini 1.5 Pro，這是 Google DeepMind 所開發的一種新的多模態混合專家模型，能夠處理超過 1000 萬個標記的上下文。
* **模型架構**。Gemini 1.5 Pro 是基於稀疏混合專家 (MoE) 轉換器的，並建立在 Google 在 MoE 模型和語言模型研究上的過去工作之上。
* **訓練基礎架構和數據集**。Gemini 1.5 Pro 訓練於多個 4096 芯片的 Google TPUv4 加速器，這些加速器分佈在多個數據中心，並使用多模態和多語言數據。
* **長上下文評估**。由於模型的長上下文能力，研究人員對模型進行了大型定量評估，包括針對長序列的困惑度，以及「乾草堆中的針」任務。
* **現實長上下文評估**。研究人員通過新的任務來評估模型，這些任務需要更長更真實的上下文，例如從單本書中學習翻譯新語言。
* **核心能力評估**。研究人員評估了模型在沒有長上下文設置的情況下在核心能力方面的表現，例如數學、科學和推理，編碼，多語言，以及指令遵循。
* **負責任的部署**。研究人員遵循一個結構化的方法，在 Gemini 模型的部署中考慮倫理和安全問題。
* **討論**。文章討論了長上下文評估的挑戰，以及在未來發展中使用更先進的長上下文模型的重要性。

希望這個摘要有幫助！

上傳一或多個儲存在本機的檔案

import PyPDF2

def extract_text_from_pdf(pdf_path):
    with open(pdf_path, 'rb') as pdf_file:
        pdf_reader = PyPDF2.PdfReader(pdf_file)
        extracted_text = ""
        for page in pdf_reader.pages:
            text = page.extract_text()
            if text:
                extracted_text += text
        return extracted_text

sample_file_2 = extract_text_from_pdf('/content/drive/MyDrive/OWASP_Top_10-2017_(en).pdf.pdf')
#sample_file_3 = extract_text_from_pdf('example-2.pdf')

顯示多份文件的提示

# Choose a Gemini model.
model = genai.GenerativeModel(model_name="gemini-1.5-flash")

prompt = "各用100字總結列出這兩份的論文的陳述,請用繁體中文回答"

response = model.generate_content([prompt, sample_file, sample_file_2])

print(response.text)

回答

## 兩份論文陳述總結：

**第一份論文：**

**OWASP Top 10 - 2017** 重點闡述了十大最嚴重的 Web 應用程序安全風險，包括注入、失效的身份驗證、敏感數據洩露、XML 外部實體 (XXE)、失效的訪問控制、安全配置錯誤、跨站腳本 (XSS)、不安全的反序列化、使用已知漏洞的組件，以及不足的日誌記錄和監控。這份文件基於 40 多家應用程序安全公司的數據提交和 500 多人的行業調查，旨在提高開發人員和組織對這些風險的認識，並提供防範措施和下一步行動建議。

**第二份論文：**

**Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context** 介绍了最新的多模態模型 Gemini 1.5 Pro。該模型採用混合專家架構，能理解長達千萬級別 token 的上下文，包括長文檔、長視頻和長音頻。論文展示了 Gemini 1.5 Pro 在長上下文理解方面的優異性能，例如在長文檔問答、長視頻問答和長上下文語音識別方面的表現。同時，該模型還展示了從單一語言學資料學習新語言的能力，例如僅通過語法手冊就能學會將英語翻譯成僅有 200 人使用的 Kalamang 語言。